Vcauxbrisebo/vm gpu support by vince-brisebois · Pull Request #851 · NVIDIA/OpenShell

vince-brisebois · 2026-04-15T18:07:54Z

Summary

Adds VFIO GPU passthrough support to openshell-vm using cloud-hypervisor as a second VMM backend alongside libkrun. Includes a full GPU bind/unbind lifecycle with safety checks, nvidia driver deadlock hardening (subprocess isolation with timeout, pre-unbind module cleanup, post-timeout verification), and an RAII guard that restores the original driver on exit.

Related Issue

N/A

Changes

VMM backend abstraction: Extract VmBackend trait with LibkrunBackend and CloudHypervisorBackend implementations; auto-select CHV when --gpu is set, reject --backend libkrun --gpu at the CLI level
GPU bind lifecycle (gpu_passthrough.rs): Probe sysfs for NVIDIA GPUs, check VFIO/IOMMU readiness, fail-closed safety checks (display outputs, /dev/nvidia* handles, IOMMU groups, VFIO modules, permissions), RAII GpuBindGuard for driver restoration
nvidia unbind deadlock hardening: Pre-unbind prep (disable persistence mode, unload nvidia_uvm/nvidia_drm/nvidia_modeset), all sysfs writes and prep commands in subprocesses with timeout (10s/15s), drop(child) without wait() to prevent parent D-state, post-timeout verification that continues if device is actually unbound
Cloud-hypervisor backend: Direct kernel boot with virtiofsd, TAP networking with NAT/port forwarding, vsock exec bridge, ACPI shutdown wrapper for --exec mode
Kernel kconfig: Add CONFIG_VIRTIO_PCI, CONFIG_SERIAL_8250, CONFIG_SERIAL_8250_CONSOLE, CONFIG_ACPI, CONFIG_PCI, CONFIG_PCI_MSI, CONFIG_DRM, CONFIG_MODULES, CONFIG_MODULE_UNLOAD
Guest rootfs: NVIDIA driver install support, device plugin and runtime class manifests, init script GPU detection and module loading
CI: gpu-ci.yml workflow on self-hosted GPU runners with OPENSHELL_VM_GPU_E2E=1
Architecture docs: Update custom-vm-runtime.md for dual-backend architecture, add vm-gpu-passthrough.md, add both to architecture/README.md index
Pre-commit fixes: rustfmt corrections, clippy ptr_arg fix in build.rs, test race condition fix in image.rs

Testing

mise run pre-commit passes
Unit tests added/updated
E2E tests added/updated (if applicable)

Checklist

Follows Conventional Commits
Commits are signed off (DCO)
Architecture docs updated (if applicable)

copy-pr-bot · 2026-04-15T18:07:59Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Vincent Caux-Brisebois <vcauxbrisebo@nvidia.com>

…a unbind hardening Signed-off-by: Vincent Caux-Brisebois <vcauxbrisebo@nvidia.com>

drew

this looks good. couple questions around deps and some approaches to bundling.

once those are resolved we can land this pr, or try to apply it on top of #858 which starts the new driver implementation for the vm.

drew · 2026-04-17T04:19:18Z

+  vm:
+    name: VM Checks
+    runs-on: build-amd64


maybe for now we can leave these out. we'll add tests once we get things over to the driver architecture.

drew · 2026-04-17T04:21:44Z

same as above. i think we stash this. once we land the initial implementation lets add the vm to our e2e tests.

drew · 2026-04-17T04:53:07Z

nit: we dont want to check this in. i capture the plan for posterity under OS-53.

drew · 2026-04-17T04:56:09Z

-#   ./build-rootfs.sh [--base] [--arch aarch64|x86_64] [output_dir]
+#   ./build-rootfs.sh [--base] [--gpu] [--arch aarch64|x86_64] [output_dir]


Will we need to publish two rootfs'?

drew · 2026-04-17T04:57:14Z

plan here is to ship a single kernel for both gpu and cpu? (i think this makes sense for now, just checking)

drew · 2026-04-17T04:58:05Z

is this change necessary?

drew · 2026-04-17T04:58:21Z

we want to avoid adding the whole vm dep here if possible. lets talk about this more since i don't know if it's straightforward to remove or not.

drew · 2026-04-17T05:05:16Z

Also ran this through codex review, its feedback looks pretty good:

The patch introduces two functional blockers in the new GPU path: the gateway CLI drops the VFIO bind guard as soon as deployment returns, and openshell-vm --gpu still boots the non-GPU rootfs by default. It also leaves host IPv4
forwarding enabled after cloud-hypervisor VMs stop.

Full review comments:

[P1] Keep the VFIO bind alive for the gateway VM lifetime — /Users/anewberry/dev/openshell/crates/openshell-cli/src/run.rs:1438-1438
On the local microVM gateway path (OPENSHELL_GATEWAY_BACKEND=vm), prepare_gateway_deploy_gpu returns a GpuBindGuard that is scoped only to gateway_admin_deploy. This function returns as soon as the gateway becomes healthy, so the guard
drops immediately and rebinds the passed-through GPU back to the host driver while the gateway VM is still running. In practice, openshell gateway start --gpu tears down its own VFIO assignment right after startup instead of keeping
the device attached for the VM lifetime.
[P1] Select a GPU-capable rootfs when --gpu is requested — /Users/anewberry/dev/openshell/crates/openshell-vm/src/main.rs:236-238
When --gpu is used without an explicit --rootfs, this still resolves the normal named/embedded rootfs. The new guest init path only succeeds on a rootfs built with build-rootfs.sh --gpu (it now requires NVIDIA userspace tools and /opt/
openshell/gpu-manifests), so the default packaged openshell-vm --gpu flow boots the wrong image and exits during init. Today the GPU path only works if the caller manually points --rootfs at a separately built GPU image.
[P2] Restore host IPv4 forwarding after CHV teardown — /Users/anewberry/dev/openshell/crates/openshell-vm/src/backend/cloud_hypervisor.rs:882-883
Any Linux cloud-hypervisor launch with TAP networking (--backend cloud-hypervisor or the automatic --gpu path) now writes /proc/sys/net/ipv4/ip_forward=1, but teardown_chv_host_networking only removes the iptables rules. After the VM
exits, the host remains in forwarding mode until the user changes it back or reboots, which is a persistent host-networking side effect introduced by this command.

vince-brisebois added 2 commits April 15, 2026 18:19

GPU support design and implementation plan

5fb51ac

Signed-off-by: Vincent Caux-Brisebois <vcauxbrisebo@nvidia.com>

feat(vm): add GPU passthrough with cloud-hypervisor backend and nvidi…

199a712

…a unbind hardening Signed-off-by: Vincent Caux-Brisebois <vcauxbrisebo@nvidia.com>

vince-brisebois force-pushed the vcauxbrisebo/vm-gpu-support branch from 6ab54d1 to 199a712 Compare April 15, 2026 18:25

drew reviewed Apr 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vcauxbrisebo/vm gpu support#851

Vcauxbrisebo/vm gpu support#851
vince-brisebois wants to merge 2 commits intomainfrom
vcauxbrisebo/vm-gpu-support

vince-brisebois commented Apr 15, 2026

Uh oh!

copy-pr-bot bot commented Apr 15, 2026

Uh oh!

drew left a comment

Uh oh!

drew Apr 17, 2026

Uh oh!

drew Apr 17, 2026

Uh oh!

drew Apr 17, 2026

Uh oh!

drew Apr 17, 2026

Uh oh!

drew Apr 17, 2026

Uh oh!

drew Apr 17, 2026

Uh oh!

drew Apr 17, 2026

Uh oh!

drew commented Apr 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# ./build-rootfs.sh [--base] [--arch aarch64\|x86_64] [output_dir]
		# ./build-rootfs.sh [--base] [--gpu] [--arch aarch64\|x86_64] [output_dir]

Conversation

vince-brisebois commented Apr 15, 2026

Summary

Related Issue

Changes

Testing

Checklist

Uh oh!

copy-pr-bot bot commented Apr 15, 2026

Uh oh!

drew left a comment

Choose a reason for hiding this comment

Uh oh!

drew Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

drew Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

drew Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

drew Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

drew Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

drew Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

drew Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

drew commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

drew commented Apr 17, 2026 •

edited

Loading